Building a Corpus for Japanese Wikification with Fine-Grained Entity Classes
نویسندگان
چکیده
In this research, we build a Wikification corpus for advancing Japanese Entity Linking. This corpus consists of 340 Japanese newspaper articles with 25,675 entity mentions. All entity mentions are labeled by a fine-grained semantic classes (200 classes), and 19,121 mentions were successfully linked to Japanese Wikipedia articles. Even with the fine-grained semantic classes, we found it hard to define the target of entity linking annotations and to utilize the fine-grained semantic classes to improve the accuracy of entity linking.
منابع مشابه
Fine-grained Arabic named entity recognition
Named Entity Recognition (NER) is a Natural Language Processing (NLP) task, which aims to extract useful information from unstructured textual data by detecting and classifying Named Entity (NE) phrases into predefined semantic classes. This thesis addresses the problem of fine-grained NER for Arabic, which poses unique linguistic challenges to NER; such as the absence of capitalisation and sho...
متن کاملAutomatically Developing a Fine-grained Arabic Named Entity Corpus and Gazetteer by utilizing Wikipedia
This paper presents a methodology to exploit the potential of Arabic Wikipedia to assist in the automatic development of a large Fine-grained Named Entity (NE) corpus and gazetteer. The corner stone of this approach is efficient classification of Wikipedia articles to target NE classes. The resources developed were thoroughly evaluated to ensure reliability and a high quality. Results show the ...
متن کاملLow-Complexity Heuristics for Deriving Fine-Grained Classes of Named Entities from Web Textual Data
We introduce a low-complexity method for acquiring fine-grained classes of named entities from the Web. The method exploits the large amounts of textual data available on the Web, while avoiding the use of any expensive text processing techniques or tools. The quality of the extracted classes is encouraging with respect to both the precision of the sets of named entities acquired within various...
متن کاملWikification for Scriptio Continua
The fact that Japanese employs scriptio continua, or a writing system without spaces, complicates the first step of an NLP pipeline. Word segmentation is widely used in Japanese language processing, and lexical knowledge is crucial for reliable identification of words in text. Although external lexical resources like Wikipedia are potentially useful, segmentation mismatch prevents them from bei...
متن کاملFine-grained Dutch named entity recognition
This paper describes the creation of a fine-grained named entity annotation scheme and corpus for Dutch, and experiments on automatic main type and subtype named entity recognition. We give an overview of existing named entity annotation schemes, and motivate our own, which describes six main types (persons, organizations, locations, products, events and miscellaneous named entities) and finer-...
متن کامل